The data set we wil be analyzing is a data set for handwriting recognition: each observation corresponds to an 8 by 8 grid indicating intensity. (We'll see below that we can recreate the image using the matplotlib imshow()
function.) The intention is to map the 64 input variables (corresponding to pixel intensities on the 8x8 grid) to a categorical system response quantity (0 through 9). Principal component analysis does not make these types of mappings or predictions, however; PCA will project the inputs from 64-dimensional space onto a smaller space, thus reducing the cost of the problem and opening up more machine learning approaches.
Principal compnents analysis is a mapping from the original, high-dimensional space of $d$ input variables to a lower-dimensional subspace of $k$ variables. The intention is to minimize the loss of information (to maximize the amount of variance in the data explained by the principal components, while minimizing the number of principal components).
It is important to note that PCA is a dimensionality reduction technique; it does not use the outputs. PCA is used to lower the dimensionality, and is thus combined with other machine-learning techniques that can use the lower-dimensional set of input variables.
In linear algebra terms, we are projecting an input vector $\mathbf{X}$ to a reduced input vector $\mathbf{Z}$ using a projection matrix $\mathbf{W}$:
$$ \mathbf{Z} = \mathbf{W}^{T} \mathbf{X} $$The construction of the projection matrix is based on a maximization problem: maximizing the variance explained by each principal component. The principal component is $\mathbf{W}_1$, a vector of length $d$, and is a weighted combination of input variables which, after the data is projected onto $\mathbf{W}_1$, spreads the data out the most.
The optimization problem thus becomes: for a given observation $i$, maximize the variance of the vector $Z_i$, which is equal to:
$$ Var(\mathbf{Z}_i) = \mathbf{W}_i^{T} \Sigma \mathbf{W}_i $$while constraining the L2 norm of each projection vector to be 1,
$$ \mathbf{W}_i^{T} \mathbf{W}_i = 1 $$This boils down to a few linear algebra operations, which we'll do by hand, and then with scikit-learn's PCA object.
%matplotlib inline
# numbers
import numpy as np
import pandas as pd
# stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# plots
import matplotlib.pyplot as plt
import seaborn as sns
# utils
import os, re
from pprint import pprint
Start by looking at the two files, .tes
and .tra
, containing test and training data, respectively:
print [j for j in os.listdir('data/optdigits')]
Each file has 65 columns: 64 inputs (consisting of the optical image representing the hand-written character) and 1 response (which digit the hand-written character represents). Split both testing and training data into input matrix $\mathbf{X}$ and response vector $\mathbf{y}$. We'll only use the training data - we're not actually training any models here, just reducing dimensionality of the problem.
## No learning, no testing, yet.
#testing_df = pd.read_csv('data/optdigits/optdigits.tes',header=None)
#X_testing, y_testing = testing_df.loc[:,0:63], testing_df.loc[:,64]
training_df = pd.read_csv('data/optdigits/optdigits.tra',header=None)
X_training, y_training = training_df.loc[:,0:63], training_df.loc[:,64]
print X_training.shape
print y_training.shape
The 64 inputs consist of an 8x8 image, which we can actually look at ourselves using matplotlib's imshow()
function:
mat = X_training.loc[0,:].reshape(8,8)
print mat
plt.imshow(mat)
plt.show()
Our input data is a bit messy - 8x8 grids of integer values with a non-standard scale. It is useful to scale the input variables, using a scikit-learn StandardScaler
object. This will scale the variables so that they have a zero mean and a standard deviation of approximately 1.
Here is a method that takes a matrix of input variables as a parameter, and returns the standardized inputs matrix, a mean value for the inputs, and a covariance matrix:
def get_normed_mean_cov(X):
X_std = StandardScaler().fit_transform(X)
X_mean = np.mean(X_std, axis=0)
## Automatic:
#X_cov = np.cov(X_std.T)
# Manual:
X_cov = (X_std - X_mean).T.dot((X_std - X_mean)) / (X_std.shape[0]-1)
return X_std, X_mean, X_cov
X_std, X_mean, X_cov = get_normed_mean_cov(X_training)
Note that by scaling our input variables, we save ourselves a little work when we project the high-dimensional inputs to their corresponding low-dimensional PCA subspace.
Now that we have the covariance matrix, we can visualize it. This tells us about the covariance between different pixels in the 8x8 images. We visually see the structure of the 8x8 images - images that are on opposite ends of the image have negative covariances, which means they are likely to be on opposite sides of the above average/below average scale. Recall the definition of covariance:
$$ Cov(\psi,\phi) = \sum_{i=1}^{N} \dfrac{ \left( \psi_i - \overline{\psi} \right) \left( \phi_i - \overline{\phi} \right) }{ N } $$fig = plt.figure(figsize=(12,12))
sns.heatmap(pd.DataFrame(X_cov), annot=False, cmap='PuOr')
plt.show()
The problem of finding the principal components was expressed above as an optimization problem. This optimization problem can be converted into a Lagrange multipliers problem, and solved by computing the eigenvalues and eigenvectors of the covariance matrix.
Thus, the eigenvalues and eigenvectors of the covariance matrix yield the principal components.
No good engineering mathematics course is complete without a dash of eigenvectors and eigenvalues - enough to make the students want to tear their hair out, and no less. The topic is useful in studying the behavior of systems, as the eigenvalues represent the characteristic values of a matrix system, and can be used in many different ways. The eigenvectors and eigenvalues of the covariance matrix will give the principal components and a vector that we can use to project high-dimensional inputs to the lower-dimensional subspace.
eigenvals, eigenvecs = np.linalg.eig(X_cov)
# Eigenvalues are not necessarily sorted, but eigenval[i] *does* correspond to eigenvec[i]
print "Eigenvals shape: "+str(eigenvals.shape)
print "Eigenvecs shape: "+str(eigenvecs.shape)
# Create a tuple of (eigenvalues, eigenvectors)
unsrt_eigenvalvec = [(np.abs(eigenvals[i]), eigenvecs[:,i]) for i in range(len(eigenvals))]
# Sort tuple by eigenvalues
eigenvalvec = sorted(unsrt_eigenvalvec, reverse=True, key=lambda x:x[0])
#pprint([pair for pair in eigenvalvec])
We can visualize the eigenvectors, sorted by eigenvalue rank, and visually identify which eigenvector components dominate in each eigenvalue. This is essentially a visualization of what information the principal component analysis has judged most important.
Note we're doing this all a bit on-the-fly - a list comprehension is looping over the first 21 eigenvalue-eigenvector pairs, which have been sorted above, and grabbing the eigenvector (pair[1]
). This is all being stuffed into a DataFrame, and everything slapped together for a nice fancy visualization. This is mainly because we have no particularly meaningful labels for the 64 input components.
fig = plt.figure(figsize=(14,3))
sns.heatmap(pd.DataFrame([pair[1] for pair in eigenvalvec[0:21]]),
annot=False, cmap='coolwarm',
vmin=-0.5,vmax=0.5)
plt.ylabel("Ranked Eigenvalue")
plt.xlabel("Eigenvector Components")
plt.show()
More of the cyclic structure of the images present in the above heatmap. The spots of high and low values indicate there may be a few values where the presence of a pixel is a strong indicator that a number is or is not a certain digit. For example, the digit 8 should have holes in the top and bottom. Supposing the 8 is legible, there are only one or two pixels where the holes can go, so the presence or absence of color can be a strong indicator for a single digit.
To measure the "principalness" of each principal component - how important it is - we can determine how much of the variance in the data it explains. (Note that we say variance here because representing the data using the mean is considered the baseline against which everything else is compared. (This is the same idea behind Pearson's R squared coefficient.) If more variance - variation from the mean - in the data can be explained by the model, the model is better.
The individual explained variance of principal component $k$ is denoted $\lambda_{k}$.
The explained variance of the $k^{th}$ principal component (given $d$ total principal components) is given by:
$$ \frac{ \lambda_1 + \lambda_2 + \dots + \lambda_k }{ \lambda_1 + \lambda_2 + \dots + \lambda_{k} + \dots + \lambda_{d}} $$We'll start by calculating individual explained variances, $\lambda_{k}$.
lam_sum = sum(eigenvals)
explained_variance = [(lam_k/lam_sum) for lam_k in sorted(eigenvals, reverse=True)]
The distribution of explained variance for each principal component gives a sense of how much information will be represented and how much lost when the full, 64-dimensional input is reduced using a principal component model (i.e., a model that utilizes only the first $N$ principal components).
plt.figure(figsize=(6, 4))
plt.bar(range(len(explained_variance)), explained_variance, alpha=0.5, align='center',
label='Individual Explained Variance $\lambda_{k}$')
plt.ylabel('Explained variance ratio')
plt.xlabel('Ranked Principal Components')
plt.title("Scree Graph")
plt.legend(loc='best')
plt.tight_layout()
fig = plt.figure(figsize=(6,4))
ax1 = fig.add_subplot(111)
ax1.plot(np.cumsum(explained_variance))
ax1.set_ylim([0,1.0])
ax1.set_xlabel('Number of Principal Components')
ax1.set_ylabel('Cumulative explained variance')
ax1.set_title('Explained Variance')
plt.show()
Our intention with the princpal component analysis is to reduce the high-dimensional input to a low-dimensional input. Ultimately that low-dimensional input is intended for use in a model, which we'll train using the training data and test using the testing data.
We'll start by specifying how many of the 64 principal components we want our model to keep. Adding more components increases the cost and the accuracy. Start with a few, go from there.
# This is a little easier to do by just using the PCA class from scikit-learn.
# But that isn't as fun as doing everything by hand.
sklearn_pca = PCA(n_components=4).fit(X_std)
Once the PCA model has been fit to the inputs, we can obtain various quantities to see what the four principal components look like. For example, to see the contributions of each of the 64 input variables to the 1st principal component, use the components_
field of the PCA object:
print sklearn_pca.components_.shape
print "Principal Components:"
print sklearn_pca.components_
Likewise a bar plot of the components can show which variables have the heaviest weights in which principal component:
colors = [sns.xkcd_rgb[z] for z in ['dusty purple','dusty green','dusty blue','orange']]
for i in range(4):
fig = plt.figure(figsize=(14,4))
sns.barplot(range(len(sklearn_pca.components_[i])), sklearn_pca.components_[i], color=colors[i])
plt.xlabel('Component Number')
plt.ylabel('Principal Component '+str(i+1)+' Value')
plt.title('Principal Component '+str(i+1))
plt.show()
Turning the principal component vector from a vector of length 64 to an 8x8 matrix reveals an interesting visual representation of the principal components:
fig = plt.figure(figsize=(10,8))
axes = [fig.add_subplot(220+i+1) for i in range(4)]
for i,ax in enumerate(axes):
sns.heatmap(sklearn_pca.components_[i].reshape(8,8),
square=True, ax=ax,
vmax = 0.30, vmin=-0.30)
ax.set_title('Principal Component '+str(i))
#ax.set_xlabel('Horz Pixels')
#ax.set_ylabel('Vert Pixels')
plt.show()
Because each of the components represents a pixel in an image, the above heatmaps show a visual map of which pixels are important in which principal components.
We know right now that 4 components is going to give an extremely crummy model. That's like saying we could get the gist of a book by picking out 3 words from each page. But how bad will it be? Surprisingly, it would explain almost 50% of the variance. From the explained variance vector:
print np.cumsum(explained_variance)[4]
Here's a list comprehension that loops over each element in explained variance, and creates a table of how many components are needed to explain a certain amount of the variance. This table can be used to pick a number of components based on an amount of variance explained.
If we select N principal components, how much of the variance will be explained?
pprint([(j, np.cumsum(explained_variance)[j]) for j in range(len(explained_variance[:33]))])
To actually transform the high-dimensional 64-input points onto the lower-dimensional subspace of principal components (the new feature subspace), we're weighting each of the input variable values according to the eigenvector. This is done using a projection matrix, which is a matrix that is multiplied by the input vector to change its size into a smaller subspace.
For example, to transform a set of 64-dimensional input vectors $\mathbf{X}$ into their approximated 4-dimensional subspace versions $\mathbf{Z}$ using the projection matrix $\mathbf{W}$, we'll need to do the following matrix multiplication:
$$ \mathbf{Z} = \mathbf{W}^{T} \mathbf{X} $$where $\mathbf{W}^{T}$ is a $4 \times 64$ projection matrix, and $\mathbf{X}$ is a $64 \times N$ matrix of the original inputs (this is the transpose of what we have stored in the X_std
variable).
This will result in a matrix $\mathbf{Z}$ that is of size $4 \times N$. This matrix can then be used as a reduced set of input variables, and a variety of other machine learning techniques can be applied. The expense of these techniques often increases as $O(N^2)$ or $O( N \log (N))$, so reducing 64 dimensions to 4 can lead to orders-of-magnitude decreases in the computational cost of machine learning models.
To use our own eignevectors/eigenvalues to create the projection matrix:
# Do it by hand first (we'll also see how to do this with the scikit-learn PCA object):
matW = np.hstack( pair[1].reshape(64,1) for pair in eigenvalvec[0:4] )
print matW.shape
# There are two ways to use the projection matrix:
# This is the "easy" way.
# This will make our projection reversed from scikit-learn,
# (flipped across one axis).
Z = X_std.dot(matW)
# This is the "correct" way.
# This projection will match scikit-learn's,
# But it also transposes Z (awkward).
#Z = matW.T.dot(X_std.T)
print Z.shape
# To color each point by the digit it represents,
# create a color map with 10 elements (10 RGB values).
# Then, use the system response (y_training), which conveniently
# is a digit from 0 to 9.
def get_cmap(n):
#colorz = plt.cm.cool
colorz = plt.get_cmap('Set1')
return [ colorz(float(i)/n) for i in range(n)]
colorz = get_cmap(10)
colors = [colorz[yy] for yy in y_training]
When we determine the explained variance due to each principal component, using a 4-component PCA, we have to use the total variance of the four components, not of the 64-dimensional explained variance calculated above. We account for this by computing the total variance as the sum of the explained variance of the first 4 components:
total_variance = np.sum(explained_variance[0:n_components])
which makes the explained variances of each of the four components add up to 1. This also keeps us consistent with the definition of explained variance for the PCA model we have selected, and keeps us consistent with the scikit-learn implementation.
fig = plt.figure(figsize=(14,6))
ax1, ax2 = [fig.add_subplot(120 + i + 1) for i in range(2)]
ax1.scatter( Z[:,0], Z[:,1] , c=colors )
ax1.set_title('Principal Components 1 and 2\nSubspace Projection')
ax2.scatter( Z[:,2], Z[:,3] , c=colors )
ax2.set_title('Principal Components 3 and 4\nSubspace Projection')
plt.legend()
plt.show()
n_components = 4
total_variance = np.sum(explained_variance[0:n_components])
for i in range(n_components):
print("Explained Variance, Principal Component %d: %0.4f"%(i,explained_variance[i]/total_variance))
pairplot_df = pd.DataFrame(Z, columns=['Principal Component '+str(j) for j in range(Z.shape[1])])
pairplot_df.reindex(pairplot_df.columns.sort_values(ascending=True))
z_columns = pairplot_df.columns
pairplot_df['Cluster'] = y_training
pairplot_df = pairplot_df.sort_values('Cluster',ascending=True)
sns.pairplot(pairplot_df, hue='Cluster',
vars=z_columns, # don't plot the category/system response
palette='Set1')
plt.show()
How do we interpret this plot?
First, it's a pleasant surprises that even with a measly four principal components, the various points corresponding to various digits still cluster nicely - in some data sets, this is not so clear. The location of these points will compose a probability density function, which could be combined with probabilistic methods to produce a prediction of what digit a point comes from based on where it is located.
The unfortunate thing about the plot is that the clusters are sitting on top of one another. More principal components, or other clustering techniques, can further differentiate these points, but with only four principal components, the explained variance is not high enough to segregate the clusters further.
We can repeat the same procedure with scikit-learn, projecting a set of high-dimensional inputs $\mathbf{X}$ to the low-dimensional subspace, yielding $\mathbf{Z}$:
# Now repeat the same process of projecting all
# the high-dimensional inputs into a lower-dimensional
# subspace, but using the scikit-learn PCA object:
scikit_Z = sklearn_pca.fit_transform(X_std)
print scikit_Z.shape
print Z.shape
The plots are identical, other than being reflected across the x-axis. (This was caused by arranging our data in transposed order, easier to visualize/manipulate, instead of the traditional linear-algebra order). Furthermore, the explained variances are identical as well:
fig = plt.figure(figsize=(14,6))
ax1, ax2 = [fig.add_subplot(120 + i + 1) for i in range(2)]
# To color each point by the digit it represents,
# create a color map with 10 elements (10 RGB values).
# Then, use the system response (y_training), which conveniently
# is a digit from 0 to 9.
def get_cmap(n):
#colorz = plt.cm.cool
colorz = plt.get_cmap('Set1')
return[ colorz(float(i)/n) for i in range(n)]
colorz = get_cmap(10)
colors = [colorz[yy] for yy in y_training]
ax1.scatter( scikit_Z[:,0], scikit_Z[:,1] , c=colors )
ax1.set_title('Principal Components 0 and 1\nSubspace Projection (scikit-learn)')
ax2.scatter( scikit_Z[:,2], scikit_Z[:,3] , c=colors )
ax2.set_title('Principal Components 2 and 3\nSubspace Projection (scikit-learn)')
plt.show()
for i in range(4):
print("Explained Variance, Principal Component %d: %0.4f"%(i,sklearn_pca.explained_variance_[i]/np.sum(sklearn_pca.explained_variance_.sum())))
We've projected the original high-dimensional inputs into a lower-dimensional subspace, using the projection matrix:
$$ \mathbf{Z} = \mathbf{W}^{T} \mathbf{X} $$It is possible to assess how good the model is by using back-projection, to transform the low-dimensional inputs back into high-dimensional space, with some loss of information. The back-projected set of high-dimensional inputs is denoted $\hat{\mathbf{X}}$:
$$ \hat{\mathbf{X}} = \mathbf{W} \mathbf{Z} $$Note that this utilizes the fact that $\mathbf{W W}^{T} = \mathbf{I}$, as implemented in the optimization constraints. The Euclidian distance between the original and the back-projected points provides a measure of how much information was lost:
$$ \| \hat{\mathbf{X}} - \mathbf{X} \|^2 $$In fact, by using PCA, we are picking principcal components such that each additional principal component will minimize this quantity.
By visually comparing the original and back-projected inputs, we can visually examine what information is lost, and how much, by using imshow()
to show the original and back-projected data.
X_digit1 = X_std[0]
Z_digit1 = X_digit1.dot(matW)
Xhat_digit1 = Z_digit1.dot(matW.T)
fig = plt.figure(figsize=(10,6))
ax1, ax2, ax3, ax4 = (fig.add_subplot(221+i) for i in range(4))
ax1.imshow(X_digit1.reshape(8,8))
ax1.set_title('Original Image')
ax2.imshow(Xhat_digit1.reshape(8,8))
ax2.set_title('Back-Projected Image')
ax3.imshow(X_digit1.reshape(8,8),
interpolation = "none")
ax3.set_title('Original Image (No Interp)')
ax4.imshow(Xhat_digit1.reshape(8,8),
interpolation = "none")
ax4.set_title('Back-Projected Image (No Interp)')
plt.show()
for j in range(5):
X_digit = X_std[j+20]
Z_digit = X_digit.dot(matW)
Xhat_digit = Z_digit.dot(matW.T)
fig = plt.figure(figsize=(10,4))
ax1, ax2 = (fig.add_subplot(121+i) for i in range(2))
ax1.imshow(X_digit.reshape(8,8),
cmap = "bone_r")
ax1.set_title('Original Image')
ax2.imshow(Xhat_digit.reshape(8,8),
cmap = "bone_r")
ax2.set_title('Back-Projected Image')
plt.show()
X_digit1 = X_std[0]
Z_digit1 = X_digit1.dot(matW)
Xhat_digit1 = Z_digit1.dot(matW.T)
mse = ((Xhat_digit1 - X_digit1)**2).mean(axis=None)
print mse
# use the scikit pca object
X_hat = sklearn_pca.inverse_transform( sklearn_pca.transform(X_std) )
mse = ((X_hat - X_std)**2).mean(axis=None)
print mse
print 1-mse
This is quite large, considering all values have been scaled to have a mean of 0 and variance of 1. This is essentially equivalent to an $R^2$ value of 0.375 ($R^2 = 1 - MSE$). Yikes!
print np.cumsum(explained_variance)[:4]
def pca_mse(n):
# Make PCA
pca = PCA(n_components=n, whiten=True)
pca.fit(X_std)
X_red = pca.transform(X_std)
# MSE associated with back-projection:
X_orig = X_std
X_hat = pca.inverse_transform( pca.transform(X_orig) )
mse = ((X_hat - X_orig)**2).mean(axis=None)
return mse
mses = []
N = 33
for i in range(1,N):
m = pca_mse(i)
print "%d-component PCA: MSE = %0.4g"%(i,m)
mses.append((i,m))
mses = np.array(mses)
plt.plot(mses[:,0],mses[:,1],'-o',label='MSE')
plt.plot(mses[:,0],1.0-mses[:,1],'-o',label='Rsq')
plt.title('Mean Square Error, Principal Components Analysis')
plt.xlabel('Number of Principal Components')
plt.ylabel('Mean Square Error')
plt.show()
plt.plot(range(1,len(explained_variance)+1),np.cumsum(explained_variance),
'-o',label='Expl Var')
plt.plot(mses[:,0],1.0-mses[:,1],'-o',label='1-MSE')
plt.title('(1-MSE) versus Explained Variance')
plt.xlabel('Number of Principal Components')
plt.ylabel('Mean Square Error/Explained Variance')
plt.legend(loc='best')
plt.show()
While the values of 1 - MSE differ from the explained variance slightly for the first few principal components, the two values are very close. Because explained variance is the more widely-used metric of error, and is easier to compute, we'd do well to stick with that.
(Future work)
In this section we will apply the principal component analysis to each class individually. This will use principal components common to all digits, to ease comparison, but use different covariances for each different class of digits.
This notebook shows some of the implementation details of a PCA model, how to visualize the principal components and the subspace projection, and explored various ways of visualizing the results, including the use of back-projection for a qualitative sense of the information lost through the principal component transform.
Future work: Further variations are possible, particularly for classification data, such as the NIST digit classification data set. Karhunen-LoƩve expansion uses class information, for example, by computing separate class covariance matrices and averaging them. Common principal components finds principal components that are common to each class, but uses different covariances for different classes. This allows for regularization - forcing a model away from overfitting data.